Scenario: How image captioning helps a business

Business scenario on news and media:

A news agency publishes hundreds of articles daily on its website. Each article contains several images relevant to the story. Writing appropriate and descriptive captions for each image manually is a tedious task and might slow down the publication process.

In this scenario, your image captioning program can expedite the process:

  1. Journalists write their articles and select relevant images to go along with the story.

  2. These images are then fed into the image captioning program (instead of manually insert description for each image).

  3. The program processes these images and generates a text file with the suggested captions for each image.

  4. The journalists or editors review these captions. They might use them as they are, or they might modify them to better fit the context of the article.

  5. These approved captions then serve a dual purpose:

    • Enhanced accessibility: The captions are integrated as alternative text (alt text) for the images in the online article. Visually impaired users, using screen readers, can understand the context of the images through these descriptions. It helps them to have a similar content consumption experience as sighted users, adhering to the principles of inclusive and accessible design.

    • Improved SEO: Properly captioned images with relevant keywords improve the article's SEO. Search engines like Google consider alt text while indexing, and this helps the article to appear in relevant search results, thereby driving organic traffic to the agency's website. This is especially useful for image search results.

  6. Once the captions are approved, they are added to the images in the online article.

By integrating this process, the agency not only expedites its publication process but also ensures all images come with appropriate descriptions, enhancing the accessibility for visually impaired readers, and improving the website's SEO. This way, the agency broadens its reach and engagement with a more diverse audience base.

Let's implement automated image captioning tool

In this section, you implement an automated image captioning program that works directly from a URL. The user provides the URL, and the code generates captions for the images found on the webpage. The output is a text file that includes all the image URLs along with their respective captions (like the image below). To accomplish this, you use BeautifulSoup for parsing the HTML content of the page and extracting the image URLs.

LongChain image credit: unsplash.com
Image urls




Let's get started:

Firstly, you send a HTTP request to the provided URL and retrieve the webpage's content. This content is then parsed by BeautifulSoup, which creates a parse tree from page's HTML. You look for 'img' tags in this parse tree as they contain the links to the images hosted on the webpage.

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  1. # URL of the page to scrape
  2. url = "https://en.wikipedia.org/wiki/IBM"
  3. # Download the page
  4. response = requests.get(url)
  5. # Parse the page with BeautifulSoup
  6. soup = BeautifulSoup(response.text, 'html.parser')

After extracting these URLs, you iterate through each one of them. You send another HTTP request to download the image data associated with each URL.

It's important to note that this operation is performed synchronously in your current implementation. That means each image is downloaded one at a time, which could be slow for webpages with a large number of images. For a more efficient approach, one could explore asynchronous programming methods or the concurrent.futures library to download multiple images simultaneously.

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  1. # Find all img elements
  2. img_elements = soup.find_all('img')
  3. # Iterate over each img elements
  4. for img_element in img_elements:
  5. ...

Complete the code below to make it work:

Create a new python file and call it automate_url_captioner.py, and copy the below code. Complete the blank part to make it work.

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
  21. 21
  22. 22
  23. 23
  24. 24
  25. 25
  26. 26
  27. 27
  28. 28
  29. 29
  30. 30
  31. 31
  32. 32
  33. 33
  34. 34
  35. 35
  36. 36
  37. 37
  38. 38
  39. 39
  40. 40
  41. 41
  42. 42
  43. 43
  44. 44
  45. 45
  46. 46
  47. 47
  48. 48
  49. 49
  50. 50
  51. 51
  52. 52
  53. 53
  54. 54
  55. 55
  56. 56
  57. 57
  58. 58
  59. 59
  1. import requests
  2. from PIL import Image
  3. from io import BytesIO
  4. from bs4 import BeautifulSoup
  5. from transformers import AutoProcessor, BlipForConditionalGeneration
  6. # Load the pretrained processor and model
  7. processor = # fill the pretrained model
  8. model = # load the blip model
  9. # URL of the page to scrape
  10. url = "https://en.wikipedia.org/wiki/IBM"
  11. # Download the page
  12. response = requests.get(url)
  13. # Parse the page with BeautifulSoup
  14. soup = BeautifulSoup(response.text, 'html.parser')
  15. # Find all img elements
  16. img_elements = soup.find_all('img')
  17. # Open a file to write the captions
  18. with open("captions.txt", "w") as caption_file:
  19. # Iterate over each img element
  20. for img_element in img_elements:
  21. img_url = img_element.get('src')
  22. # Skip if the image is an SVG or too small (likely an icon)
  23. if 'svg' in img_url or '1x1' in img_url:
  24. continue
  25. # Correct the URL if it's malformed
  26. if img_url.startswith('//'):
  27. img_url = 'https:' + img_url
  28. elif not img_url.startswith('http://') and not img_url.startswith('https://'):
  29. continue # Skip URLs that don't start with http:// or https://
  30. try:
  31. # Download the image
  32. response = requests.get(img_url)
  33. # Convert the image data to a PIL Image
  34. raw_image = Image.open(BytesIO(response.content))
  35. if raw_image.size[0] * raw_image.size[1] < 400: # Skip very small images
  36. continue
  37. raw_image = raw_image.convert('RGB')
  38. # Process the image
  39. inputs = processor(raw_image, return_tensors="pt")
  40. # Generate a caption for the image
  41. out = model.generate(**inputs, max_new_tokens=50)
  42. # Decode the generated tokens to text
  43. caption = processor.decode(out[0], skip_special_tokens=True)
  44. # Write the caption to the file, prepended by the image URL
  45. caption_file.write(f"{img_url}: {caption}\n")
  46. except Exception as e:
  47. print(f"Error processing image {img_url}: {e}")
  48. continue
Click here for the answer
  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
  21. 21
  22. 22
  23. 23
  24. 24
  25. 25
  26. 26
  27. 27
  28. 28
  29. 29
  30. 30
  31. 31
  32. 32
  33. 33
  34. 34
  35. 35
  36. 36
  37. 37
  38. 38
  39. 39
  40. 40
  41. 41
  42. 42
  43. 43
  44. 44
  45. 45
  46. 46
  47. 47
  48. 48
  49. 49
  50. 50
  51. 51
  52. 52
  53. 53
  54. 54
  55. 55
  56. 56
  57. 57
  58. 58
  59. 59
  1. import requests
  2. from PIL import Image
  3. from io import BytesIO
  4. from bs4 import BeautifulSoup
  5. from transformers import AutoProcessor, BlipForConditionalGeneration
  6. # Load the pretrained processor and model
  7. processor = AutoProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
  8. model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")
  9. # URL of the page to scrape
  10. url = "https://en.wikipedia.org/wiki/IBM"
  11. # Download the page
  12. response = requests.get(url)
  13. # Parse the page with BeautifulSoup
  14. soup = BeautifulSoup(response.text, 'html.parser')
  15. # Find all img elements
  16. img_elements = soup.find_all('img')
  17. # Open a file to write the captions
  18. with open("captions.txt", "w") as caption_file:
  19. # Iterate over each img element
  20. for img_element in img_elements:
  21. img_url = img_element.get('src')
  22. # Skip if the image is an SVG or too small (likely an icon)
  23. if 'svg' in img_url or '1x1' in img_url:
  24. continue
  25. # Correct the URL if it's malformed
  26. if img_url.startswith('//'):
  27. img_url = 'https:' + img_url
  28. elif not img_url.startswith('http://') and not img_url.startswith('https://'):
  29. continue # Skip URLs that don't start with http:// or https://
  30. try:
  31. # Download the image
  32. response = requests.get(img_url)
  33. # Convert the image data to a PIL Image
  34. raw_image = Image.open(BytesIO(response.content))
  35. if raw_image.size[0] * raw_image.size[1] < 400: # Skip very small images
  36. continue
  37. raw_image = raw_image.convert('RGB')
  38. # Process the image
  39. inputs = processor(raw_image, return_tensors="pt")
  40. # Generate a caption for the image
  41. out = model.generate(**inputs, max_new_tokens=50)
  42. # Decode the generated tokens to text
  43. caption = processor.decode(out[0], skip_special_tokens=True)
  44. # Write the caption to the file, prepended by the image URL
  45. caption_file.write(f"{img_url}: {caption}\n")
  46. except Exception as e:
  47. print(f"Error processing image {img_url}: {e}")
  48. continue

As an output, you will have a new file in the explorer (same directory) with the name captions.txt (as shown in the image below).

Captions.txt

Bonus: Image captioning for local files (Run locally if using Blip2)

With a few modifications, you can adapt the code to operate on local images. This involves utilizing the glob library to sift through all image files in a specific directory and then writing the generated captions to a text file.

Additionally, you can make use of the Blip2 model, which is a more powerful pre-trained model for image captioning. In fact, you can easily incorporate any new pre-trained model that becomes available, as they are continuously developed to be more powerful. In the example below, we demonstrate the usage of the Blip2 model. However, please be aware that the Blip2 model requires 10GB of space, which prevents us from running it in the CloudIDE environment.

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  1. import glob
  2. ...
  3. # Specify the directory where your images are
  4. image_dir = "/path/to/your/images"
  5. image_exts = ["jpg", "jpeg", "png"] # specify the image file extensions to search for
  6. # Open a file to write the captions
  7. with open("captions.txt", "w") as caption_file:
  8. # Iterate over each image file in the directory
  9. for image_ext in image_exts:
  10. for img_path in glob.glob(os.path.join(image_dir, f"*.{image_ext}")):
  11. # Load your image
  12. raw_image = Image.open(img_path).convert('RGB')
  13. ...

Try to implement yourself.

Click here to see the complete version
  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
  21. 21
  22. 22
  23. 23
  24. 24
  25. 25
  26. 26
  27. 27
  28. 28
  29. 29
  30. 30
  31. 31
  32. 32
  33. 33
  1. import os
  2. import glob
  3. import requests
  4. from PIL import Image
  5. from transformers import Blip2Processor, Blip2ForConditionalGeneration #Blip2 models
  6. # Load the pretrained processor and model
  7. processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
  8. model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b")
  9. # Specify the directory where your images are
  10. image_dir = "/path/to/your/images"
  11. image_exts = ["jpg", "jpeg", "png"] # specify the image file extensions to search for
  12. # Open a file to write the captions
  13. with open("captions.txt", "w") as caption_file:
  14. # Iterate over each image file in the directory
  15. for image_ext in image_exts:
  16. for img_path in glob.glob(os.path.join(image_dir, f"*.{image_ext}")):
  17. # Load your image
  18. raw_image = Image.open(img_path).convert('RGB')
  19. # You do not need a question for image captioning
  20. inputs = processor(raw_image, return_tensors="pt")
  21. # Generate a caption for the image
  22. out = model.generate(**inputs, max_new_tokens=50)
  23. # Decode the generated tokens to text
  24. caption = processor.decode(out[0], skip_special_tokens=True)
  25. # Write the caption to the file, prepended by the image file name
  26. caption_file.write(f"{os.path.basename(img_path)}: {caption}\n")